Reasoning about Regular Properties: A Comparative Study

Abstract. Several new algorithms for deciding emptiness of Boolean combinations of regular languages and of languages of alternating automata have been proposed recently, especially in the context of analysing regular expressions and in string constraint solving. The new algorithms demonstrated a significant potential, but they have never been systematically compared, neither among each other nor with the state-of-the art implementations of existing (non)deterministic automata-based methods. In this paper, we provide such comparison as well as an overview of the existing algorithms and their implementations. We collect a diverse benchmark mostly originating in or related to practical problems from string constraint solving, analysing LTL properties, and regular model checking, and evaluate collected implementations on it. The results reveal the best tools and hint on what the best algorithms and implementation techniques are. Roughly, although some advanced algorithms are fast, such as antichain algorithms and reductions to IC3/PDR, they are not as overwhelmingly dominant as sometimes presented and there is no clear winner. The simplest NFA-based technology may sometimes be a better choice, depending on the problem source and the implementation style. We believe that our findings are relevant for development of automata techniques as well as for related fields such as string constraint solving.

Updates and News

Links to resources

Summary of Tools and Benchmarks

In following we provide brief list of used benchmarks and tools. For full description, either see the paper, the replication package or our benchmark repository.

Tools

Boolean combinations of regular expressions benchmark.

This group of BRE contains benchmarks on which we can run all tools, including those based on NFA and DFA. They have small to moderate numbers of minterms (about 30 in average, at most over a hundred).

AFA Benchmark

The second group of examples contains AFA not easily convertible to BRE. Here we can run only tools that handle general AFA emptiness. Some of these benchmarks also have large sets of minterms (easily reaching to thousands) and complex formulae in the AFA transition function, hence converting them to restricted forms such such as separated DNF or explicit may be very costly.

Key results

In the following, we list our key results summarized in tables and figures. For bigger vector version of the images, click on the image preview.

Summary of benchmarks

Summary of AFA and BRE benchmarks. Table lists (i) the average, (ii) the median, and (iii) the number of timeouts and errors (in brackets). Winners are highlighted in bold.

Cummulative runtime on benchmarks

Cactus plots of AFA and BRE benchmarks. The y axis is the cumulative time taken on the benchmark in logarithmic scale, benchmark on the x axis are ordered by the runtime of each tool.

Runtime on parametric benchmarks

Models of runtime on parametric benchmarks based on specific parameter k with timeout 60s. The sawtooths represent the tool failed on the benchmark for some k while solving benchmarks for k-1 and k+1. For brevity, we draw the models only until they start continually failing.

Contributing

We encourage researchers and practitioners to actively contribute to our ongoing efforts by expanding our results through the addition of new benchmarks, new tools and by engaging with our open-source repositories. By introducing new benchmarks that encompass a wider range of scenarios and challenges, we can collectively advance the field and provide more comprehensive evaluations of various tools and techniques. Our repositories are open for forking, enabling interested individuals to explore, modify, and build upon our codebase. If you are interested in contributing your own benchmark, we suggest following:

  1. Fork our benchchmark repository. Add your own benchmark in .mata (see definition of .mata format) or .afa format.
  2. Create a Pull Request from your fork. We will check your contribution, optionally request changes and if we are satisfied, we will merge the changes into our repository.
  3. Note that we will try our best to include your benchmarks into our experiments, but we cannot assure we will have enough time / resources to do so. So please be considerate.
If you are interested in contributing your own tool, we suggest following:
  1. Sends us repository, with clear instructions how to compile and run your tool
  2. Since, our environment is in form of virtual machine, you can potentially send us the whole machine, however, we prefer, to check and run your tool ourselves (this might give you additional feedback to your tool as well).
  3. We will process the results.
  4. Note that, we might try to create some better way of integrating the tools and distributing the environment (e.g., through dockerfile or some other alternative). However, we believe, virtual machine is sufficient.

Limitations and Future work

Threats to validity. Our results must be taken with a grain of salt as the experiment contains an inherent room for error. Although we tried to be as fair as possible, not knowing every tool intimately, the conversions between formats and kinds of automata, discussed at the start of Section 5, might have introduced biases into the experiment. Tools are written in different languages and some have parameters which we might have used in sub-optimal way (we use the tools in their default settings), or, in the case of libraries, we could have used a sub-optimal combination of functions. We also did not measure memory peaks, which could be especially interesting e.g. in when the tools are deployed on a cloud. We are, however, confident that our main conclusions are well justified and the experiment gives a good overall picture. The entire experiment is available for anyone to challenge or improve upon.
Currently, we work on the following extensions of our experimental evaluation:
  1. Enhancing our benchmarks and environment: Our current setup is still rather a prototype with lots of obsolete code, steps and scripts. We wish to enhance the virtual machine, the benchmark repository as well as results repository to better shape.
  2. Enhancing the precision of the results: We mentioned above some of the threats to validity of our results. We wish to enhance the precision and our own trust in the results. We think we might have been unfair to some tools, and, moreover, we are quite surprised by some results (e.g., we expected the implementation of the AFA-emptiness implementations of work of [D'Antoni18] implemented in tool that we refer to as Bisim to be faster). We wish to explore other parameters of the tools, as well as enhancing the results by measuring in higher level of granularity (i.e., including how much parsing, preprocessing steps or conversion between formats take).
  3. Comparing tools based on other metrics: We compared the tools based on the runtime only. We believe, that comparing the tools, e.g., regarding the memory peak (i.e., the maximal amount of allocated memory) might provide additional insight and could be useful to those that wish to apply these tools and approaches in practice (e.g., in cloud or their own server).

Contact Us

Acknowledgements

This work has been supported by the Czech Ministry of Education, Youth and Sports ERC.CZ project LL1908, and the FIT BUT internal project FIT-S-20-6427.

Get Our Benchmark